Do it yourself

导个包先

requests 请求库和 lxml XML 解析库

实际上只需要 lxml 中的 etree 模块，所以使用 from ... import ...

python

import requests
from lxml import etree

通过查阅 requests 的官方文档我们知道，可以像这样发送一个 GET 请求，来获取到网页的源代码

这里手动配置了请求头 headers 以初步模拟浏览器发出的请求，避免被服务器轻易识别。关于这部分知识，参见 MDN Docs

大部分网站都会配备反爬虫机制，它们会通过检查你的请求参数、统计你的请求频率等等方法来决定是否阻止你的请求（更严重地，封禁你的 IP / 账号等）
这里有一个工具网站 https://httpbin.org/headers ，它会将你发送给服务器的请求头作为响应内容发回给你。尝试不配置请求头去请求这个地址，看看 requests 的默认请求头是什么样的；再试着用浏览器访问，看看正常的浏览器的请求头又是什么样的。

python

url = "https://www.bilibili.com/video/BV1iW421R71g"
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
}
resp = requests.get(url, headers=headers)
source = resp.text

现在 source 中存储着网页的源代码，我们使用 lxml 库对其进行解析

手动指定 parser 是为了更宽容地处理 HTML，否则遇到写法不太规范的网页会抛出错误

python

html: etree.ElementBase = etree.fromstring(
    source, parser=etree.HTMLParser()
)

还记得我们先前记下来的 XPath 吗，该派上用场了。使用它定位网页中包含封面 URL 的元素。

不建议总是依赖浏览器给出的 XPath ，有时自己根据网页结构写出的 XPath 更具有通用性。（前提是没写错）

python

xpath = '//meta[@itemprop="image"]'
element: etree.ElementBase = html.xpath(xpath)[0]
print(element.attrib)
cover_url = element.attrib.get("content")

Output

txt

{'data-vue-meta': 'true', 'itemprop': 'image', 'content': '//i0.hdslb.com/bfs/archive/2ff337f507754e0d172298d3e9b815413b4a63b7.jpg@100w_100h_1c.png'}

对得到的封面 URL 做一些处理，加上协议名，移除调整参数之类的

python

cover_url = "https:"+cover_url
cover_url = cover_url.split("@")[0]
print(cover_url)

Output

txt

https://i0.hdslb.com/bfs/archive/2ff337f507754e0d172298d3e9b815413b4a63b7.jpg

成功得到了最终的 URL。

但是这还不够，我们需要将它封装为一个函数（方便复用），并做一些优化。

让我们回到先前的文档中。

2.1.1 Book

2.1.2 Lab

2.1.3 Project

2.2.1 Python Grammar

2.2.2 Project Perspective

2.2.2.1 Proj-00

2.2.2.2 Proj-01

2.2.2.3 Proj-02

2.2.2.4 Proj-03

2.2.2.5 Proj-04

2.2.3 CS Intro

2.2.3.1 Building Abstraction with Functions

Do it yourself

2.2.2.1 Proj-00

2.2.2.2 Proj-01

2.2.2.3 Proj-02

2.2.2.4 Proj-03

2.2.2.5 Proj-04

2.2.3.1 Building Abstraction with Functions

Do it yourself ​

Do it yourself